Search CORE

312 research outputs found

Some methods for blindfolded record linkage

BACKGROUND: The linkage of records which refer to the same entity in separate data collections is a common requirement in public health and biomedical research. Traditionally, record linkage techniques have required that all the identifying data in which links are sought be revealed to at least one party, often a third party. This necessarily invades personal privacy and requires complete trust in the intentions of that party and their ability to maintain security and confidentiality. Dusserre, Quantin, Bouzelat and colleagues have demonstrated that it is possible to use secure one-way hash transformations to carry out follow-up epidemiological studies without any party having to reveal identifying information about any of the subjects – a technique which we refer to as "blindfolded record linkage". A limitation of their method is that only exact comparisons of values are possible, although phonetic encoding of names and other strings can be used to allow for some types of typographical variation and data errors. METHODS: A method is described which permits the calculation of a general similarity measure, the n-gram score, without having to reveal the data being compared, albeit at some cost in computation and data communication. This method can be combined with public key cryptography and automatic estimation of linkage model parameters to create an overall system for blindfolded record linkage. RESULTS: The system described offers good protection against misdeeds or security failures by any one party, but remains vulnerable to collusion between or simultaneous compromise of two or more parties involved in the linkage operation. In order to reduce the likelihood of this, the use of last-minute allocation of tasks to substitutable servers is proposed. Proof-of-concept computer programmes written in the Python programming language are provided to illustrate the similarity comparison protocol. CONCLUSION: Although the protocols described in this paper are not unconditionally secure, they do suggest the feasibility, with the aid of modern cryptographic techniques and high speed communication networks, of a general purpose probabilistic record linkage system which permits record linkage studies to be carried out with negligible risk of invasion of personal privacy

CiteSeerX

Crossref

Springer - Publisher Connector

PubMed Central

The Australian National University

Quantifying Privacy: A Novel Entropy-Based Measure of Disclosure Risk

Author: A Oganian
C Dwork
CCM Fung
CJ Skinner
D Lambert
DE Denning
F Al-Saggaf
GT Duncan
JR Griggs
L Brankovic
L Brankovic
L Brankovic
L Brankovic
L Brankovic
L Brankovic
L Brankovic
L Brankovic
L Sankar
L Willenborg
M Trottini
N Lopez
N López
NR Adam
P Horak
P Tendick
R Ahlswede
S Fletcher
S Morris
T King
V Estivill-Castro
V Estivill-Castro
WA Fuller
WE Winkler
WE Yancey
Y Al-Saggaf
Publication venue
Publication date: 07/09/2014
Field of study

It is well recognised that data mining and statistical analysis pose a serious treat to privacy. This is true for financial, medical, criminal and marketing research. Numerous techniques have been proposed to protect privacy, including restriction and data modification. Recently proposed privacy models such as differential privacy and k-anonymity received a lot of attention and for the latter there are now several improvements of the original scheme, each removing some security shortcomings of the previous one. However, the challenge lies in evaluating and comparing privacy provided by various techniques. In this paper we propose a novel entropy based security measure that can be applied to any generalisation, restriction or data modification technique. We use our measure to empirically evaluate and compare a few popular methods, namely query restriction, sampling and noise addition.Comment: 20 pages, 4 figure

arXiv.org e-Print Archive

University of Newcastle's Digital Repository

Crossref

Integral privacy

Author: C Dwork
E Bertino
G Barbier
K Muralidhar
S Das
WE Winkler
YL Simmhan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

El treball es va presentar a la International Conference on Cryptology and Network Security (15th : 2016 : Milan, Italy)When considering data provenance some problems arise from the need to safely handle provenance related functionality. If some modifications have to be performed in a data set due to provenance related requirements, e.g. remove data from a given user or source, this will affect not only the data itself but also all related models and aggregated information obtained from the data. This is specially aggravated when the data are protected using a privacy method (e.g. masking method), since modification in the data and the model can leak information originally protected by the privacy method. To be able to evaluate privacy related problems in data provenance we introduce the notion of integral privacy as compared to the well known definition of differential privacy

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Diposit Digital de Documents de la UAB

Escaping the Big Brother: an empirical study on factors influencing identification and information leakage on the Web

Author: Bartunov S
Bilgic M
Carmagnola F
Hay M
Iofciu T
Labitzke S
Levenshtein V
Mislove A
Monge AE
Shehab M
Shen W
Winkler WE
Zafarani R
Publication venue: 'SAGE Publications'
Publication date: 01/01/2014
Field of study

This paper presents a study on factors that may increase the risks of personal information leakage, due to the possibility of connecting user profiles that are not explicitly linked together. First, we introduce a technique for user identification based on cross-site checking and linking of user attributes. Then, we describe the experimental evaluation of the identification technique both on a real setting and on an online sample, showing its accuracy to discover unknown personal data. Finally, we combine the results on the accuracy of identification with the results of a questionnaire completed by the same subjects who performed the test on the real setting. The aim of the study was to discover possible factors that make users vulnerable to this kind of techniques. We found out that the number of social networks used, their features and especially the amount of profiles abandoned and forgotten by the user are factors that increase the likelihood of identification and the privacy risks

Crossref

Open Research Online (The Open University)

Archivio istituzionale della ricerca - Università di Genova

Medical record linkage in health information systems by approximate string matching and clustering

Author: A Baxter
A Ben-Dor
AE Monge
AE Monge
AK McCallum
Antoine Buemi
AP Dempster
B Everitt
C Quantin
E Hartuv
EH Porter
Erik A Sauleau
G Navarro
G Navarro
H Kawaji
HB Newcombe
HB Newcombe
I Fellegi
J Hartigan
JA Hylthon
Jean-Philippe Paumier
M Fortini
M Hernandez
M Pavan
MA Jaro
MA Jaro
P Eades
P Sellers
R Baeza-Yates
R Sharan
R Sharan
T Fruchterman
T Kamada
T Vintsyuk
TF Smith
TR Belin
V Levenhstein
W Cohen
WE Winkler
WE Winkler
WE Yancey
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: Multiplication of data sources within heterogeneous healthcare information systems always results in redundant information, split among multiple databases. Our objective is to detect exact and approximate duplicates within identity records, in order to attain a better quality of information and to permit cross-linkage among stand-alone and clustered databases. Furthermore, we need to assist human decision making, by computing a value reflecting identity proximity. METHODS: The proposed method is in three steps. The first step is to standardise and to index elementary identity fields, using blocking variables, in order to speed up information analysis. The second is to match similar pair records, relying on a global similarity value taken from the Porter-Jaro-Winkler algorithm. And the third is to create clusters of coherent related records, using graph drawing, agglomerative clustering methods and partitioning methods. RESULTS: The batch analysis of 300,000 "supposedly" distinct identities isolates 240,000 true unique records, 24,000 duplicates (clusters composed of 2 records) and 3,000 clusters whose size is greater than or equal to 3 records. CONCLUSION: Duplicate-free databases, used in conjunction with relevant indexes and similarity values, allow immediate (i.e.: real-time) proximity detection when inserting a new identity

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Recommended from our members

Using Administrative Data to Count Local Populations

Author: A Bruin De
D Martin
E Ericksen
E Nordholt
Gillian Harper
J Burghardt
Les Mayhew
M Freedman
M Poulsen
P Myrskyla
P Redfern
P Redfern
P Redfern
R Burrows
R Webber
S Jenkins
WE Winkler
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

There is growing evidence that official population statistics based on the decennial census are inaccurate at the local authority level—the fundamental administrative unit of the UK. This paper investigates the use of locally available administrative data sets for counting populations. The method uses truth tables for combining different data sources with different population coverage according to a defined and therefore replicable set of rules. The result is timelier and geographically more flexible data which is more cost-effective to produce than a survey-based census. Associated techniques for linking diverse data sources at individual and household level are briefly discussed. The methodology is then applied to administrative data from a London borough with about 170,000 people. The results are evaluated and compared with other population sources. The paper concludes by discussing potential improvements including scaling up the work to cover multiple local authorities. The practicalities of using alternative central government data sets are briefly considered. A sequel paper in this journal provides examples of key applications of this approach at local level

City Research Online

Crossref

Springer - Publisher Connector

Technical challenges of providing record linkage services for research

Author: A Ferrante
A Ferrante
Adrian P Brown
Anna M Ferrante
C Kelman
D Rosman
DE Clark
DP Jutte
DV Ford
EL Brook
H Newcombe
H Newcombe
HB Newcombe
I Fellegi
Jacqueline K Bauer
James B Semmens
James H Boyd
JH Boyd
L Roos
LE Gill
LL Roos
LL Roos
MA Hernández
OECD
R Pinder
R Schnell
S Gomatam
S Kendrick
S Kendrick
SE Hall
Sean M Randall
SM Randall
SW Kendrick
TH Herzog
WE Winkler
WE Winkler
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Background: Record linkage techniques are widely used to enable health researchers to gain event based longitudinal information for entire populations. The task of record linkage is increasingly being undertaken by specialised linkage units (SLUs). In addition to the complexity of undertaking probabilistic record linkage, these units face additional technical challenges in providing record linkage ‘as a service’ for research. The extent of this functionality, and approaches to solving these issues, has had little focus in the record linkage literature. Few, if any, of the record linkage packages or systems currently used by SLUs include the full range of functions required. Methods: This paper identifies and discusses some of the functions that are required or undertaken by SLUs in the provision of record linkage services. These include managing routine, on-going linkage; storing and handling changing data; handling different linkage scenarios; accommodating ever increasing datasets. Automated linkage processes are one way of ensuring consistency of results and scalability of service. Results: Alternative solutions to some of these challenges are presented. By maintaining a full history of links, and storing pairwise information, many of the challenges around handling ‘open’ records, and providing automated managed extractions are solved. A number of these solutions were implemented as part of the development of the National Linkage System (NLS) by the Centre for Data Linkage (part of the Population Health Research Network) in Australia.Conclusions: The demand for, and complexity of, linkage services are growing. This presents as a challenge to SLUs as they seek to service the varying needs of dozens of research projects annually. Linkage units need to be both flexible and scalable to meet this demand. It is hoped the solutions presented here can help mitigate these difficulties

Crossref

Springer - Publisher Connector

espace@Curtin

Estimating parameters for probabilistic linkage of privacy-preserved datasets.

Author: A Ferrante
A Wajda
Adrian P. Brown
Anna M. Ferrante
D Vatsalan
GP Basharin
I Fellegi
James B. Semmens
James H. Boyd
JH Boyd
MA Jaro
R Schnell
R Schnell
S Randall
Sean M. Randall
SL DuVall
SM Randall
TC Ong
WE Winkler
Y Thibaudeau
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Background: Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters. Methods: Our method was tested through a simulation study using synthetic data, followed by an application using real-world administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for de-duplication linkages. Linkage quality was determined by F-measure. Each dataset was privacy-preserved using separate Bloom filters for each field. Match probabilities were estimated using the expectation-maximisation (EM) algorithm on the privacy-preserved data. Threshold cut-off values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. De-duplication linkages of each privacy-preserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the F-measure at the estimated threshold values was also compared to the highest F-measure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on real-world data. Results: Linkage of the synthetic datasets using the estimated probabilities produced an F-measure that was comparable to the F-measure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an F-measure that was higher than the F-measure using calculated probabilities. Further, the threshold estimation yielded results for F-measure that were only slightly below the highest possible for those probabilities. Conclusions: The method appears highly accurate across a spectrum of datasets with varying degrees of error. As there are few alternatives for parameter estimation, the approach is a major step towards providing a complete operational approach for probabilistic linkage of privacy-preserved datasets

Crossref

Directory of Open Access Journals

espace@Curtin

An efficient record linkage scheme using graphical analysis for identifier error detection

Author: A Arasu
A McCallum
A Sarah Walker
David H Wyllie
DH Wyllie
DH Wyllie
DH Wyllie
DV Kalashnikov
EA Sauleau
I Fellegi
John M Finney
L Phillips
RA Lyons
S Chapman
S Deepayan
SE Schaeffer
T Teorey
Tim EA Peto
V Levenhstein
V Rares
WE Winkler
WN Venables
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Integration of information on individuals (record linkage) is a key problem in healthcare delivery, epidemiology, and "business intelligence" applications. It is now common to be required to link very large numbers of records, often containing various combinations of theoretically unique identifiers, such as NHS numbers, which are both incomplete and error-prone

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

UCL Discovery

Oxford University Research Archive